{
  "cells": [
    {
      "cell_type": "markdown",
      "id": "c3ee8d00",
      "metadata": {},
      "source": [
        "# How to split by character\n",
        "\n",
        ":::info Prerequisites\n",
        "\n",
        "This guide assumes familiarity with the following concepts:\n",
        "\n",
        "- [Text splitters](/docs/concepts/text_splitters)\n",
        "\n",
        ":::\n",
        "\n",
        "This is the simplest method for splitting text. This splits based on a given character sequence, which defaults to `\"\\n\\n\"`. Chunk length is measured by number of characters.\n",
        "\n",
        "1. How the text is split: by single character separator.\n",
        "2. How the chunk size is measured: by number of characters.\n",
        "\n",
        "To obtain the string content directly, use `.splitText()`.\n",
        "\n",
        "To create LangChain [Document](https://api.js.langchain.com/classes/langchain_core.documents.Document.html) objects (e.g., for use in downstream tasks), use `.createDocuments()`."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 1,
      "id": "313fb032",
      "metadata": {},
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Document {\n",
            "  pageContent: \"Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and th\"... 839 more characters,\n",
            "  metadata: { loc: { lines: { from: 1, to: 17 } } }\n",
            "}\n"
          ]
        }
      ],
      "source": [
        "import { CharacterTextSplitter } from \"@langchain/textsplitters\";\n",
        "import * as fs from \"node:fs\";\n",
        "\n",
        "// Load an example document\n",
        "const rawData = await fs.readFileSync(\"../../../../examples/state_of_the_union.txt\");\n",
        "const stateOfTheUnion = rawData.toString();\n",
        "\n",
        "const textSplitter = new CharacterTextSplitter({\n",
        "    separator: \"\\n\\n\",\n",
        "    chunkSize: 1000,\n",
        "    chunkOverlap: 200,\n",
        "});\n",
        "const texts = await textSplitter.createDocuments([stateOfTheUnion]);\n",
        "console.log(texts[0])"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "dadcb9d6",
      "metadata": {},
      "source": [
        "You can also propagate metadata associated with each document to the output chunks:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 2,
      "id": "1affda60",
      "metadata": {},
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Document {\n",
            "  pageContent: \"Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and th\"... 839 more characters,\n",
            "  metadata: { document: 1, loc: { lines: { from: 1, to: 17 } } }\n",
            "}\n"
          ]
        }
      ],
      "source": [
        "const metadatas = [{ document: 1 }, { document: 2 }];\n",
        "\n",
        "const documents = await textSplitter.createDocuments(\n",
        "    [stateOfTheUnion, stateOfTheUnion], metadatas\n",
        ")\n",
        "\n",
        "console.log(documents[0])"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "ee080e12-6f44-4311-b1ef-302520a41d66",
      "metadata": {},
      "source": [
        "To obtain the string content directly, use `.splitText()`:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 3,
      "id": "2a830a9f",
      "metadata": {},
      "outputs": [
        {
          "data": {
            "text/plain": [
              "\u001b[32m\"Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and th\"\u001b[39m... 839 more characters"
            ]
          },
          "execution_count": 3,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "const chunks = await textSplitter.splitText(stateOfTheUnion);\n",
        "\n",
        "chunks[0];"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "cd4dd67a",
      "metadata": {},
      "source": [
        "## Next steps\n",
        "\n",
        "You've now learned a method for splitting text by character.\n",
        "\n",
        "Next, check out a [more advanced way of splitting by character](/docs/how_to/recursive_text_splitter), or the [full tutorial on retrieval-augmented generation](/docs/tutorials/rag)."
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Deno",
      "language": "typescript",
      "name": "deno"
    },
    "language_info": {
      "file_extension": ".ts",
      "mimetype": "text/x.typescript",
      "name": "typescript",
      "nb_converter": "script",
      "pygments_lexer": "typescript",
      "version": "5.3.3"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 5
}